-
Notifications
You must be signed in to change notification settings - Fork 46
fix: add retry logic for one time instruction #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1d3de06
to
15b7568
Compare
15b7568
to
845d8b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the right approach. If an instruction fails I would still expect the failure count to be updated in the secret before another attempt is made.
Have we looked into potentially resetting the appliedChecksumKey
field in the secret after the system agent restarts and reattempts the instruction? It may be that this needs to be the general behavior, as opposed to plan/Windows specific (which it currently is IIRC).
845d8b0
to
dcb804b
Compare
@HarrisonWAffel, I made the changes per your suggestion. Can you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-functional change request: discussion on edge cases.
@@ -215,10 +215,10 @@ func (w *watcher) start(ctx context.Context, strictVerify bool) { | |||
logrus.Infof("Detected first start, force-applying one-time instruction set") | |||
needsApplied = true | |||
hasRunOnce = true | |||
secret.Data[appliedChecksumKey] = []byte("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HarrisonWAffel I believe this will work, but I think we need to address ResetFailureCountOnStartup
if we are changing the default behavior. Do you think we would be fine removing it from the plan and then removing it from Rancher? What about re-running failed plans at startup? Asking because these seem like mostly windows cases so I need a refresher on the context (I remember there is some interesting behavior regarding potentially cyclic/competing services).
There is also the fact that this introduces a change to the system-agent-upgrader
plan which will re-run the latest plan during the upgrade, not a bug but definitely worth noting, since plans may be running during agent upgrades (all ready a possibility now, but much more likely something shows up in the UI).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v2.10.0
the planner was updated to reattempt the Windows install plans multiple times before marking them as failed, as there can be transient issues that are not representative of a true plan failure. The problem I encountered was that, if a plan failed to apply 3 times but then succeeded, it would only be reattempted two times after the next reboot as the failure-count would still be equal to 3
.
Handling this situation was complicated by an issue in rke2 which resulted in calico HNS namespaces being deleted each time rke2 was restarted (typically via the one-time-instruction). In that case, the plan should not be reattempted. If it was, the node might eventually be marked as available, even though some behavior (like deleting pods) would be completely broken. The solution was to introduce this field and conditionally set it based off of the clusters k8s version.
I think we still need to consider that situation. The rke2 fix was delivered in August of last year, so users have had plenty of time to upgrade, but removing this field and changing the default behavior could potentially silently break some existing clusters. I would be in support of doing that for 2.12, and communicating it in the release notes.
The existing change for applied-checksum
shouldn't run into the above issue though.
This PR addresses issue rancher/rancher#48916 by implementing retry logic for one-time instructions.
Changes
retry.OnError
for one-time instruction executionTesting
Manually tested.
Logs: